From bdf71c97ea0a25a1fe85d5f64049aaae5e54f13e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lucas=20Rod=C3=A9s-Guirao?= Date: Mon, 2 Dec 2024 23:23:38 +0100 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=8A=20hmd=20update=20(#3642)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 📊 hmd update * snapshot * wip * dag * archive old hmd * wip * wip * ci/cd * wip * wip * wip * wip * wip * change column names * wip * wip * wip * propagate snapshot metadata * wip * wip * garden * missing dimensions * improve debug message * wip * wip * fix dimension * wip * wip * wip * nit memory opt * wip * fix origins propagation * grapher * dataset title * minor metadata update --- dag/archive/demography.yml | 14 +- dag/demography.yml | 20 +- dag/main.yml | 5 - .../2024-12-02/survivor_percentiles.meta.yml | 44 ++ .../2024-12-02/survivor_percentiles.py | 137 +++++++ .../garden/hmd/2024-11-27/hmd.countries.json | 52 +++ .../data/garden/hmd/2024-11-27/hmd.meta.yml | 387 ++++++++++++++++++ etl/steps/data/garden/hmd/2024-11-27/hmd.py | 228 +++++++++++ .../2024-12-02/survivor_percentiles.py | 35 ++ etl/steps/data/grapher/hmd/2024-11-27/hmd.py | 82 ++++ etl/steps/data/meadow/hmd/2024-11-27/hmd.py | 325 +++++++++++++++ snapshots/hmd/2024-11-27/hmd.py | 25 ++ snapshots/hmd/2024-11-27/hmd.zip.dvc | 74 ++++ 13 files changed, 1416 insertions(+), 12 deletions(-) create mode 100644 etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.meta.yml create mode 100644 etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.py create mode 100644 etl/steps/data/garden/hmd/2024-11-27/hmd.countries.json create mode 100644 etl/steps/data/garden/hmd/2024-11-27/hmd.meta.yml create mode 100644 etl/steps/data/garden/hmd/2024-11-27/hmd.py create mode 100644 etl/steps/data/grapher/demography/2024-12-02/survivor_percentiles.py create mode 100644 etl/steps/data/grapher/hmd/2024-11-27/hmd.py create mode 100644 etl/steps/data/meadow/hmd/2024-11-27/hmd.py create mode 100644 snapshots/hmd/2024-11-27/hmd.py create mode 100644 snapshots/hmd/2024-11-27/hmd.zip.dvc diff --git a/dag/archive/demography.yml b/dag/archive/demography.yml index 185500becad..3879d5f3c6f 100644 --- a/dag/archive/demography.yml +++ b/dag/archive/demography.yml @@ -52,9 +52,21 @@ steps: data-private://grapher/un/2024-07-11/un_wpp_full: - data-private://garden/un/2024-07-11/un_wpp - # Population density + # Population density data://garden/demography/2023-06-12/population_density: - data://garden/demography/2023-03-31/population - data://garden/faostat/2024-03-14/faostat_rl data://grapher/demography/2023-06-12/population_density: - data://garden/demography/2023-06-12/population_density + + # HMD + data://meadow/hmd/2022-12-07/life_tables: + - snapshot://hmd/2022-12-07/hmd.zip + data://garden/hmd/2022-12-07/life_tables: + - data://meadow/hmd/2022-12-07/life_tables + + # Survivorship ages (HMD-derived) + data://garden/demography/2023-09-27/survivor_percentiles: + - data://garden/hmd/2023-09-19/hmd + data://grapher/demography/2023-09-27/survivor_percentiles: + - data://garden/demography/2023-09-27/survivor_percentiles diff --git a/dag/demography.yml b/dag/demography.yml index 7b68f6214de..cdd5e679957 100644 --- a/dag/demography.yml +++ b/dag/demography.yml @@ -123,18 +123,20 @@ steps: data://grapher/hmd/2023-09-19/hmd: - data://garden/hmd/2023-09-19/hmd + # Human Mortality Database + data://meadow/hmd/2024-11-27/hmd: + - snapshot://hmd/2024-11-27/hmd.zip + data://garden/hmd/2024-11-27/hmd: + - data://meadow/hmd/2024-11-27/hmd + data://grapher/hmd/2024-11-27/hmd: + - data://garden/hmd/2024-11-27/hmd + # Gini Life Expectancy Inequality data://garden/demography/2023-10-04/gini_le: - data://garden/demography/2023-10-03/life_tables data://grapher/demography/2023-10-04/gini_le: - data://garden/demography/2023-10-04/gini_le - # Survivorship ages (HMD-derived) - data://garden/demography/2023-09-27/survivor_percentiles: - - data://garden/hmd/2023-09-19/hmd - data://grapher/demography/2023-09-27/survivor_percentiles: - - data://garden/demography/2023-09-27/survivor_percentiles - # Phi-gender life expectancy inequality data://garden/demography/2023-10-03/phi_gender_le: - data://garden/demography/2023-10-03/life_tables @@ -245,3 +247,9 @@ steps: - data://meadow/demography/2024-11-26/multiple_births data://grapher/demography/2024-11-26/multiple_births: - data://garden/demography/2024-11-26/multiple_births + + # Survivorship ages (HMD-derived) + data://garden/demography/2024-12-02/survivor_percentiles: + - data://garden/hmd/2024-11-27/hmd + data://grapher/demography/2024-12-02/survivor_percentiles: + - data://garden/demography/2024-12-02/survivor_percentiles diff --git a/dag/main.yml b/dag/main.yml index f134f1d6793..147f2c28f01 100644 --- a/dag/main.yml +++ b/dag/main.yml @@ -130,11 +130,6 @@ steps: - data://garden/regions/2023-01-01/regions data://grapher/technology/2022/internet: - data://garden/technology/2022/internet - # HMD - data://meadow/hmd/2022-12-07/life_tables: - - snapshot://hmd/2022-12-07/hmd.zip - data://garden/hmd/2022-12-07/life_tables: - - data://meadow/hmd/2022-12-07/life_tables # UNDP data://meadow/un/2024-04-09/undp_hdr: diff --git a/etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.meta.yml b/etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.meta.yml new file mode 100644 index 00000000000..07e19bde3f9 --- /dev/null +++ b/etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.meta.yml @@ -0,0 +1,44 @@ +# NOTE: To learn more about the fields, hover over their names. +definitions: + common: + presentation: + topic_tags: + - Life Expectancy + +# Learn more about the available fields: +# http://docs.owid.io/projects/etl/en/latest/architecture/metadata/reference/dataset/ +dataset: + title: Survivorship percentiles (HMD; Alvarez and Vaupel 2023) + update_period_days: 365 + +# Learn more about the available fields: +# http://docs.owid.io/projects/etl/en/latest/architecture/metadata/reference/tables/ +tables: + survivor_percentiles: + variables: + age: + title: Survivorship age + unit: years + processing_level: major + description_short: |- + <%- if percentile == 1 -%> + The age until which the 1st percentile (99% of the population) of the population would survive until, if they experienced the same age-specific death rates throughout their whole lives as the age-specific death rates seen in that particular year. + <%- else -%> + The age until which the << percentile>>th percentile (<< 100 - percentile|int>>% of the population) of the population would survive until, if they experienced the same age-specific death rates throughout their whole lives as the age-specific death rates seen in that particular year. + <%- endif -%> + + description_processing: |- + This was calculated with the method published in Alvarez and Vaupel (2023), with code provided by the authors: + + Jesús-Adrián Alvarez, James W. Vaupel; Mortality as a Function of Survival. Demography 1 February 2023; 60 (1): 327–342. doi: https://doi.org/10.1215/00703370-10429097 + + These estimates were regenerated for data from more recent years in the Human Mortality Database. + + Original R code from: https://github.com/jssalvrz/s-ages + description_key: + - This is calculated with the period life tables indicators. + display: + numDecimalPlaces: 1 + presentation: + attribution: |- + Alvarez & Vaupel (2023); Human Mortality Database (2024) diff --git a/etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.py b/etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.py new file mode 100644 index 00000000000..1f2b1ef59cc --- /dev/null +++ b/etl/steps/data/garden/demography/2024-12-02/survivor_percentiles.py @@ -0,0 +1,137 @@ +"""Load a meadow dataset and create a garden dataset. + +Methods used here are taken from https://github.com/jssalvrz/s-ages. Authors of Citation: Alvarez, J.-A., & Vaupel, J. W. (2023). Mortality as a Function of Survival. Demography, 60(1), 327–342. https://doi.org/10.1215/00703370-10429097 + + +Dr. Saloni Dattani translated the R scripts into Python: + - Original: https://github.com/jssalvrz/s-ages + - Translated: https://github.com/saloni-nd/misc/tree/main/survivorship-ages + +Lucas Rodes-Guirao adapted the python code for ETL. +""" + +import numpy as np +import pandas as pd +from owid.catalog import Table +from scipy.integrate import cumulative_trapezoid as cumtrapz +from scipy.interpolate import InterpolatedUnivariateSpline + +from etl.helpers import PathFinder, create_dataset + +# Get paths and naming conventions for current step. +paths = PathFinder(__file__) + + +def run(dest_dir: str) -> None: + # + # Load inputs. + # + paths.log.info("load data.") + # Load meadow dataset. + ds_meadow = paths.load_dataset("hmd") + + # Read table from meadow dataset. + tb_deaths = ds_meadow.read("deaths") + tb_exposure = ds_meadow.read("exposures") + + # + # Process data. + # + # Combine tables, drop NaNs + tb = tb_deaths.merge(tb_exposure, on=["country", "year", "sex", "age"], how="outer") + tb = tb.dropna(subset=["deaths", "exposure"], how="any") + + # Keep format="1x1", and sex="both" + paths.log.info("keep period & 1-year data.") + tb = tb.loc[tb["age"].str.match(r"^(\d{1,3}|d{3}\+)$") & (tb["type"] == "period")] + + # Drop unused columns + tb = tb.drop(columns=["type"]) + + # 110+ -> 110 + paths.log.info("replace 110+ -> 100, set Dtypes.") + tb["age"] = tb["age"].replace({"110+": "110"}).astype(int) + + # Sort + tb = tb.sort_values(["year", "age"]) + + # Actual calculation + paths.log.info("calculate surviorship ages (can take some minutes)...") + columns_grouping = ["country", "sex", "year"] + tb = tb.groupby(columns_grouping).apply(lambda group: obtain_survivorship_ages(group)).reset_index() # type: ignore + + # Unpivot + paths.log.info("reshape table") + tb = tb.melt( + id_vars=["country", "sex", "year"], + value_vars=["s1", "s10", "s20", "s30", "s40", "s50", "s60", "s70", "s80", "s90", "s99"], + var_name="percentile", + value_name="age", + ) + tb = tb.dropna(subset=["percentile"]) + tb["percentile"] = tb["percentile"].str.replace("s", "").astype(int) + tb["percentile"] = 100 - tb["percentile"] + + # Propagate metadata + tb["age"].metadata.origins = tb_exposure["exposure"].m.origins.copy() + + # Set index + paths.log.info("format") + tb = tb.format(["country", "year", "sex", "percentile"], short_name="survivor_percentiles") + + # + # Save outputs. + # + # Create a new garden dataset with the same metadata as the meadow dataset. + ds_garden = create_dataset( + dest_dir, tables=[tb], check_variables_metadata=True, default_metadata=ds_meadow.metadata + ) + + # Save changes in the new garden dataset. + ds_garden.save() + + +def obtain_survivorship_ages(tb_group: Table, start_age: int = 0, end_age: int = 110) -> pd.DataFrame: + """Get survivorship ages given a life and deaths table. + + Output dataframe has a column for each percentile of survivorship age. + + tb_group is expected to be a subset of the compelte table. It should only concern a particular (country, year, sex) triple. + """ + # Step 1: Apply splines, get Mx for each (country, year, sex, age) + ## Define splines + ### We could use CubicSpline (k=3 order), but it provides slightly different results hence, for precaution, we sticked to InterpolatedUnivariateSpline. + ### This is equivalent to R function interpSpline + spline_deaths = InterpolatedUnivariateSpline(tb_group["age"], tb_group["deaths"], k=3) + spline_exposures = InterpolatedUnivariateSpline(tb_group["age"], tb_group["exposure"], k=3) + + ## Define age range (with step 0.01) + age_range = np.arange(start_age, end_age, 0.01) + + # Run splines over age range + deaths_spline = np.abs(spline_deaths(age_range)) + exposure_spline = np.abs(spline_exposures(age_range)) + exposure_spline[exposure_spline == 0] = np.nan + survival_age_spline = np.abs(deaths_spline / exposure_spline) + + # Step 2: Calculate survival, density, hazard, and cumulative hazards + ## Estimate parameters + Hx = cumtrapz(y=survival_age_spline, x=age_range, initial=0) # Hazard CDF + Sx = np.exp(-Hx) # Survivor function + + # Step 3: Calculate survivorship ages from parameters + out = {} + out["s0"] = max(age_range) + ## I'm using a for loop to simplify the logic here + for i in range(1, 101): + try: + sx_rounded = np.ceil((100 * Sx).round(3)) + value = age_range[sx_rounded == i][0] + out[f"s{i}"] = value + except IndexError: + out[f"s{i}"] = np.nan + + # Create output dataframe + df = pd.DataFrame(out, index=[0]) + + return df diff --git a/etl/steps/data/garden/hmd/2024-11-27/hmd.countries.json b/etl/steps/data/garden/hmd/2024-11-27/hmd.countries.json new file mode 100644 index 00000000000..c5fb3b64be0 --- /dev/null +++ b/etl/steps/data/garden/hmd/2024-11-27/hmd.countries.json @@ -0,0 +1,52 @@ +{ + "Australia": "Australia", + "Austria": "Austria", + "Belarus": "Belarus", + "Belgium": "Belgium", + "Bulgaria": "Bulgaria", + "Canada": "Canada", + "Chile": "Chile", + "Croatia": "Croatia", + "Czechia": "Czechia", + "Denmark": "Denmark", + "East Germany": "East Germany", + "Estonia": "Estonia", + "Finland": "Finland", + "Germany": "Germany", + "Greece": "Greece", + "Hong Kong": "Hong Kong", + "Hungary": "Hungary", + "Iceland": "Iceland", + "Ireland": "Ireland", + "Japan": "Japan", + "Latvia": "Latvia", + "Lithuania": "Lithuania", + "Luxembourg": "Luxembourg", + "Netherlands": "Netherlands", + "New Zealand": "New Zealand", + "Norway": "Norway", + "Poland": "Poland", + "Portugal": "Portugal", + "Republic of Korea": "South Korea", + "Russia": "Russia", + "Slovenia": "Slovenia", + "Spain": "Spain", + "Sweden": "Sweden", + "Switzerland": "Switzerland", + "Taiwan": "Taiwan", + "Ukraine": "Ukraine", + "United Kingdom": "United Kingdom", + "West Germany": "West Germany", + "England and Wales, Civilian National Population": "England and Wales (Civilians)", + "England and Wales, Total Population": "England and Wales", + "France, Civilian Population": "France (Civilians)", + "France, Total Population": "France", + "Israel, Total Population": "Israel", + "Italy ": "Italy", + "New Zealand -- Maori": "New Zealand (Maori)", + "New Zealand -- Non-Maori": "New Zealand (Non-Maori)", + "Northern Ireland": "Northern Ireland", + "Scotland": "Scotland", + "Slovakia ": "Slovakia", + "The United States of America": "United States" +} \ No newline at end of file diff --git a/etl/steps/data/garden/hmd/2024-11-27/hmd.meta.yml b/etl/steps/data/garden/hmd/2024-11-27/hmd.meta.yml new file mode 100644 index 00000000000..0cd4ff29340 --- /dev/null +++ b/etl/steps/data/garden/hmd/2024-11-27/hmd.meta.yml @@ -0,0 +1,387 @@ +# NOTE: To learn more about the fields, hover over their names. +definitions: + common: + presentation: + attribution_short: HMD + topic_tags: + - Life Expectancy + + others: + display_name_dim: |- + at << 'birth' if (age == '0') else age >><< ', ' + sex + 's' if (sex != 'total') >>, << type >> + title_public_dim: |- + at << age if age != '0' else 'birth'>> + global: + life_expectancy: + point_1: |- + <%- if type == "period" -%> + Period life expectancy is a metric that summarizes death rates across all age groups in one particular year. + <%- else -%> + Cohort life expectancy is the average lifespan of a group of people, usually a birth cohort – people born in the same year. + <%- endif -%> + point_2: |- + <%- if type == "period" -%> + <%- if age == '0' -%> + For a given year, it represents the average lifespan for a hypothetical group of people, if they experienced the same age-specific death rates throughout their whole lives as the age-specific death rates seen in that particular year. + <%- else -%> + For a given year, it represents the remaining average lifespan for a hypothetical group of people, if they experienced the same age-specific death rates throughout the rest of their lives as the age-specific death rates seen in that particular year. + <%- endif -%> + <%- else -%> + <%- if age == '0' -%> + It is calculated by tracking individuals from that cohort throughout their lives until death, and calculating their average lifespan. + <%- else -%> + It is calculated by tracking individuals from that cohort throughout the rest of their lives until death, and calculating their average remaining lifespan. + <%- endif -%> + <%- endif -%> + +# Learn more about the available fields: +# http://docs.owid.io/projects/etl/architecture/metadata/reference/dataset/ +dataset: + update_period_days: 365 + description: |- + The Human Mortality Database (HMD) is a collaborative project sponsored by the University of California, Berkeley (in the United States of America) and the Max Planck Institute for Demographic Research (in Germany). + + It provides researchers with comprehensive data on mortality from around 40 countries around the world, which have very high coverage and quality of data at the national level, through vital registration and potentially census data. + + Data is given in terms of period or cohort estimates: + + - **Period data** refers to a snapshot estimated with data at a particular interval. For period life expectancy at birth, this refers to the estimated life expectancy at birth based on a synthetic cohort created using mortality rates across age groups in a given year. + - **Cohort data** refers to estimates of a particular birth cohort. For cohort life expectancy at birth, this refers to the average number of years that people in the birth cohort survived. Cohort data may use birth cohorts that are ‘almost extinct’ rather than entirely extinct. + + 'Interval' refers to the specific age- and time- period of the estimate. An interval can be a one year period for a single-age group, or it can be wider. For example, the life expectancy of a 40 year old in 2019 corresponds to an interval of 1 single-age group in 1 year. The central death rate of 5–9 year olds in 2020 corresponds to an interval of a 5 year age group in 1 year. + +# Learn more about the available fields: +# http://docs.owid.io/projects/etl/architecture/metadata/reference/tables/ +tables: + life_tables: + common: + presentation: + title_variant: << sex + 's, ' if sex != 'total' >><< type + ' tables'>> + topic_tags: + - Life Expectancy + + variables: + central_death_rate: + title: Central death rate + description_short: |- + The death rate, calculated as the number of deaths divided by the average number of people alive during the interval. + description_key: + - "The death rate is measured using the number of person-years lived during the interval." + - "Person-years refers to the combined total time that a group of people has lived. For example, if 10 people each live for 2 years, they collectively contribute 20 person-years." + - "The death rate is slightly different from the 'probability of death' during the interval, because the 'probability of death' metric uses a different denominator: the number of people alive at that age at the start of the interval, while this indicator uses the average number of people alive during the interval." + unit: deaths per 1,000 people + processing_level: minor + description_processing: |- + The original metric is given as a fraction between 0 and 1 (i.e. per-capita). We multiply this by 1,000 to get a per-1,000 people rate. + display: + name: |- + {tables.life_tables.variables.central_death_rate.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.central_death_rate.title} {definitions.others.title_public_dim} + topic_tags: + - Life Expectancy + + probability_of_death: + title: Probability of death + unit: "%" + description_short: |- + The probability of dying in a given interval, among people who survived to the start of that interval. + description_key: + - "For example, the probability of death for a 50 year old in a given year is found by: dividing the number of deaths in 50 year olds that year, by the number of people alive at the age of 50 at the start of the year." + processing_level: minor + description_processing: |- + The original metric is given as a fraction between 0 and 1 (i.e. per-capita). We multiply this by 100 to get a percentage. + display: + name: |- + {tables.life_tables.variables.probability_of_death.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.probability_of_death.title} {definitions.others.title_public_dim} + topic_tags: + - Life Expectancy + + average_survival_length: + title: Average survival length + short_unit: years + unit: years + description_short: Average length of survival between ages x and x+n for persons dying in the interval. + display: + name: |- + {tables.life_tables.variables.average_survival_length.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.average_survival_length.title} {definitions.others.title_public_dim} + + number_survivors: + title: Number of survivors + unit: survivors + description_short: Number of survivors at a given age, assuming survivors at 0 years old is 100,000. + display: + name: |- + {tables.life_tables.variables.number_survivors.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.number_survivors.title} {definitions.others.title_public_dim} + + number_deaths: + title: Number of deaths + short_unit: deaths + unit: deaths + description_short: Number of deaths between ages x and x+n. + display: + name: |- + {tables.life_tables.variables.number_deaths.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.number_deaths.title} {definitions.others.title_public_dim} + topic_tags: + - Life Expectancy + + number_person_years_lived: + title: Number of person-years lived + unit: person-years + description_short: Number of person-years lived between ages x and x+n. + display: + name: |- + {tables.life_tables.variables.number_person_years_lived.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.number_person_years_lived.title} {definitions.others.title_public_dim} + + number_person_years_remaining: + title: Number of person-years remaining + unit: person-years + description_short: Number of person-years remaining after a given age. + display: + name: |- + {tables.life_tables.variables.number_person_years_remaining.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.number_person_years_remaining.title} {definitions.others.title_public_dim} + + life_expectancy: + title: Life expectancy + short_unit: years + unit: years + description_short: |- + <%- if age == '0' -%> + <%- if sex == 'total' -%> + The << type >> life expectancy at birth, in a given year. + <%- else -%> + The << type >> life expectancy at birth among << sex + 's' >>, in a given year. + <%- endif -%> + <%- else -%> + <%- if sex == 'total' -%> + The remaining << type >> life expectancy at age << age >>, in a given year. + <%- else -%> + The remaining << type >> life expectancy at age << age >> among << sex + 's' >>, in a given year. + <%- endif -%> + <%- endif -%> + description_key: + - |- + {definitions.global.life_expectancy.point_1} + - |- + {definitions.global.life_expectancy.point_2} + - |- + <%- if age != '0' -%> + <%- if type == "period" -%> + This shows the remaining period life expectancy among people who have already reached the age << age >>, using death rates from their age group and older age groups. + <%- else -%> + This shows the remaining cohort life expectancy of people who have reached the age << age >>. + <%- endif -%> + <%- endif -%> + display: + numDecimalPlaces: 1 + name: |- + {tables.life_tables.variables.life_expectancy.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.life_tables.variables.life_expectancy.title} {definitions.others.title_public_dim} + + exposures: + common: + presentation: + title_variant: << sex + 's, ' if sex != 'total' >><< type + ' tables'>> + topic_tags: + - Life Expectancy + + variables: + exposure: + title: Exposure-to-risk + unit: person-years + description_short: The total number of person-years lived within a given interval. + description_key: + - It is equivalent to the average number of people living in that age group during the period. + description_from_producer: |- + Estimates of the population exposed to the risk of death during some age-time interval are based on annual (January 1st) population estimates, with small corrections that reflect the timing of deaths during the interval. Period exposure estimations are based on assumptions of uniformity in the distribution of events except when historical monthly birth data are available. + display: + name: |- + {tables.exposures.variables.exposure.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.exposures.variables.exposure.title} {definitions.others.title_public_dim} + + deaths: + common: + presentation: + topic_tags: + - Global Health + title_variant: << sex + 's, ' if sex != 'total' >> + + variables: + deaths: + title: Number of deaths + unit: deaths + description_short: |- + <% if sex == 'total' %> + The total number of deaths at age << age >> in a given year. + <%- else %> + The total number of << sex >> deaths at age << age >> in a given year. + <%- endif %> + display: + name: |- + {tables.deaths.variables.deaths.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.deaths.variables.deaths.title} {definitions.others.title_public_dim} + + population: + common: + presentation: + topic_tags: + - Population Growth + title_variant: << sex + 's, ' if sex != 'total' >> + + variables: + population: + title: Population + unit: people + description_short: |- + <% if sex == 'total' %> + The total number of people aged << age >> living in a country. + <%- else %> + The total number of << sex + 's' >> aged << age >> living in a country. + <%- endif %> + description_processing: |- + From HMD Notes: For populations with territorial changes, two sets of population estimates are given for years in which a territorial change occurred. The first set of estimates (identified as year "19xx-") refers to the population just before the territorial change, whereas the second set (identified as year "19xx+") refers to the population just after the change. For example, in France, the data for "1914-" cover the previous territory (i.e., as of December 31, 1913), whereas the data for "1914+" reflect the territorial boundaries as of January 1, 1914. + + We have used the "19xx+" population estimates for the year of the territorial change. + display: + name: |- + {tables.population.variables.population.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.population.variables.population.title} {definitions.others.title_public_dim} + + births: + common: + presentation: + topic_tags: + - Fertility Rate + title_variant: << sex + 's, ' if sex != 'total' >> + + variables: + births: + title: Births + unit: births + description_short: |- + <% if sex == 'total' %> + The total number of births in a given year. + <%- else %> + The total number of << sex >> births in a given year. + <%- endif %> + display: + name: |- + {tables.births.variables.births.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.births.variables.births.title} {definitions.others.title_public_dim} + + birth_rate: + title: Birth rate + unit: births per 1,000 people + description_short: |- + <% if sex == 'total' %> + The total number of births per 1,000 people in a given year. + <%- else %> + The total number of << sex >> births per 1,000 in a given year. + <%- endif %> + display: + name: |- + {tables.births.variables.births.title} {definitions.others.display_name_dim} + presentation: + title_public: |- + {tables.births.variables.births.title} {definitions.others.title_public_dim} + + diff_ratios: + common: + presentation: + topic_tags: + - Life Expectancy + + variables: + central_death_rate_mf_ratio: + title: Central death rate ratio (m/f) + unit: "" + description_short: |- + The ratio of the << type >> central death rate (males to females) at age << age >>. + processing_level: major + display: + name: |- + Central death rate (male-to-female ratio) at << 'birth' if (age == '0') else age >>, << type >> + presentation: + title_public: Central death rate {definitions.others.title_public_dim} + title_variant: |- + male-to-female ratio, << type >> tables + topic_tags: + - Life Expectancy + - Gender Ratio + + life_expectancy_fm_diff: + title: Life expectancy difference (f-m) + short_unit: years + unit: years + description_short: |- + The difference in the << type >> life expectancy (females - males) at age << age >>. + processing_level: major + description_key: + - Higher values indicate longer life expectancy among females than males. + - |- + {definitions.global.life_expectancy.point_1} + - |- + {definitions.global.life_expectancy.point_2} + display: + numDecimalPlaces: 1 + name: |- + Life expectancy (female-male difference) at << 'birth' if (age == '0') else age >>, << type >> + presentation: + title_public: Life expectancy at << age if age != '0' else 'birth'>> + title_variant: female-male difference, << type >> tables + topic_tags: + - Life Expectancy + - Gender Ratio + + life_expectancy_mf_ratio: + title: Life expectancy ratio (f/m) + unit: "" + short_unit: "" + description_short: |- + The ratio of the << type >> life expectancy (males to females) at age << age >>. + processing_level: major + description_key: + - Higher values indicate longer life expectancy among females than males. + - |- + {definitions.global.life_expectancy.point_1} + - |- + {definitions.global.life_expectancy.point_2} + display: + numDecimalPlaces: 1 + name: |- + Life expectancy (female-to-male ratio) at << 'birth' if (age == '0') else age >>, << type >> + presentation: + title_public: Life expectancy at << age if age != '0' else 'birth'>> + title_variant: female-to-male ratio, << type >> tables + topic_tags: + - Life Expectancy + - Gender Ratio diff --git a/etl/steps/data/garden/hmd/2024-11-27/hmd.py b/etl/steps/data/garden/hmd/2024-11-27/hmd.py new file mode 100644 index 00000000000..d1fde80301b --- /dev/null +++ b/etl/steps/data/garden/hmd/2024-11-27/hmd.py @@ -0,0 +1,228 @@ +"""Load a meadow dataset and create a garden dataset.""" + +import numpy as np +from owid.catalog import Table + +from etl.data_helpers import geo +from etl.helpers import PathFinder, create_dataset + +# Get paths and naming conventions for current step. +paths = PathFinder(__file__) + + +def run(dest_dir: str) -> None: + # + # Load inputs. + # + # Load meadow dataset. + ds_meadow = paths.load_dataset("hmd") + + # Read table from meadow dataset. + paths.log.info("reading tables") + tb_lt = ds_meadow.read("life_tables") + tb_exp = ds_meadow.read("exposures") + tb_mort = ds_meadow.read("deaths") + tb_pop = ds_meadow.read("population") + tb_births = ds_meadow.read("births") + + # Drop NaNs + tb_exp = tb_exp.dropna(subset="exposure") + tb_births = tb_births.dropna(subset="births") + + # + # Process data. + # + paths.log.info("processing tables") + + # 1/ Life tables + def _sanity_check_lt(tb): + summary = tb.groupby(["country", "year", "sex", "type", "age"], as_index=False).size().sort_values("size") + row_dups = summary.loc[summary["size"] != 1] + assert row_dups.shape[0] <= 19, "Found duplicated rows in life tables!" + assert (row_dups["country"].unique() == "Switzerland").all() & ( + row_dups["year"] <= 1931 + ).all(), "Unexpected duplicates in life tables!" + + flag = ( + (tb_lt["country"] == "Switzerland") + & (tb_lt["age"] == "110+") + & (tb_lt["type"] == "cohort") + & (tb_lt["sex"] == "male") + & (tb_lt["year"] <= 1931) + & (tb_lt["year"] >= 1913) + ) + tb = tb.loc[~flag] + + return tb + + tb_lt = process_table( + tb=tb_lt, + col_index=["country", "year", "sex", "age", "type"], + sex_expected={"females", "males", "total"}, + callback_post=_sanity_check_lt, + ) + # Scale central death rates + tb_lt["central_death_rate"] = tb_lt["central_death_rate"] * 1_000 + tb_lt["probability_of_death"] = tb_lt["probability_of_death"] * 100 + + # 2/ Exposures + tb_exp = process_table( + tb=tb_exp, + col_index=["country", "year", "sex", "age", "type"], + ) + + # 3/ Mortality + tb_mort = process_table( + tb=tb_mort, + col_index=["country", "year", "sex", "age", "type"], + ) + assert set(tb_mort["type"].unique()) == {"period"}, "Unexpected values in column 'type' in mortality tables!" + tb_mort = tb_mort.drop(columns="type") + + # 4/ Population + tb_pop = process_table( + tb=tb_pop, + col_index=["country", "year", "sex", "age"], + ) + + # 5/ Births + tb_births = process_table( + tb=tb_births, + col_index=["country", "year", "sex"], + ) + + def add_birth_rate(tb_pop, tb_births): + tb_pop_agg = tb_pop.groupby(["country", "year", "sex"], as_index=False)["population"].sum() + tb_births = tb_births.merge(tb_pop_agg, on=["country", "year", "sex"], how="left") + tb_births["birth_rate"] = tb_births["births"] / tb_births["population"] * 1_000 + tb_births["birth_rate"] = tb_births["birth_rate"].replace([np.inf, -np.inf], np.nan) + tb_births = tb_births.drop(columns=["population"]) + return tb_births + + tb_births = add_birth_rate(tb_pop, tb_births) + + # 6/ Create table with differences and ratios + tb_ratios = make_table_diffs_ratios(tb_lt) + + # Create list with tables + paths.log.info("saving tables") + tables = [ + tb_lt.format(["country", "year", "sex", "age", "type"]), + tb_exp.format(["country", "year", "sex", "age", "type"]), + tb_mort.format(["country", "year", "sex", "age"]), + tb_pop.format(["country", "year", "sex", "age"]), + tb_births.format(["country", "year", "sex"]), + tb_ratios.format(["country", "year", "age", "type"], short_name="diff_ratios"), + ] + + # + # Save outputs. + # + # Create a new garden dataset with the same metadata as the meadow dataset. + ds_garden = create_dataset( + dest_dir, + tables=tables, + check_variables_metadata=True, + ) + + # Save changes in the new garden dataset. + ds_garden.save() + + +def process_table(tb, col_index, sex_expected=None, callback_post=None): + """Reshape a table. + + Input table has column `format`, which is sort-of redundant. This function ensures we can safely drop it (i.e. no duplicate rows). + + Additionally, it standardizes the dimension values. + """ + paths.log.info(f"processing table {tb.m.short_name}") + + if sex_expected is None: + sex_expected = {"female", "male", "total"} + + # Standardize dimension values + tb = standardize_sex_cat_names(tb, sex_expected) + + # Drop duplicate rows + tb = tb.sort_values("format").drop_duplicates(subset=[col for col in tb.columns if col != "format"], keep="first") + + # Check no duplicates + if callback_post is not None: + tb = callback_post(tb) + else: + summary = tb.groupby(col_index, as_index=False).size().sort_values("size") + row_dups = summary.loc[summary["size"] != 1] + assert row_dups.empty, "Found duplicated rows in life tables!" + + # Final dropping o f columns + tb = tb.drop(columns="format") + + # Country name standardization + tb = geo.harmonize_countries( + df=tb, + countries_file=paths.country_mapping_path, + ) + + # Make year column integer + tb["year"] = tb["year"].astype(int) + + return tb + + +def standardize_sex_cat_names(tb, sex_expected): + # Define expected sex categories + sex_expected = {s.lower() for s in sex_expected} + + # Set sex categories to lowercase + tb["sex"] = tb["sex"].str.lower() + + # Sanity check categories + sex_found = set(tb["sex"].unique()) + assert sex_found == sex_expected, f"Unexpected sex categories! Found {sex_found} but expected {sex_expected}" + + # Rename + tb["sex"] = tb["sex"].replace({"females": "female", "males": "male"}) + + return tb + + +def make_table_diffs_ratios(tb: Table) -> Table: + """Create table with metric differences and ratios. + + Currently, we estimate: + + - female - male: Life expectancy + - male/female: Life Expectancy, Central Death Rate + """ + # Pivot & obtain differences and ratios + cols_index = ["country", "year", "age", "type"] + tb_new = ( + tb.pivot_table( + index=cols_index, + columns="sex", + values=["life_expectancy", "central_death_rate"], + ) + .assign( + life_expectancy_fm_diff=lambda df: df[("life_expectancy", "female")] - df[("life_expectancy", "male")], + life_expectancy_mf_ratio=lambda df: df[("life_expectancy", "male")] / df[("life_expectancy", "female")], + central_death_rate_mf_ratio=lambda df: df[("central_death_rate", "male")] + / df[("central_death_rate", "female")], + ) + .reset_index() + ) + + # Keep relevant columns + cols = [col for col in tb_new.columns if col[1] == ""] + tb_new = tb_new.loc[:, cols] + + # Rename columns + tb_new.columns = [col[0] for col in tb_new.columns] + + # Add metadata back + for col in tb_new.columns: + if col not in cols_index: + tb_new[col].metadata.origins = tb["life_expectancy"].m.origins.copy() + tb_new[col] = tb_new[col].replace([np.inf, -np.inf], np.nan) + + return tb_new diff --git a/etl/steps/data/grapher/demography/2024-12-02/survivor_percentiles.py b/etl/steps/data/grapher/demography/2024-12-02/survivor_percentiles.py new file mode 100644 index 00000000000..1e319eaee4c --- /dev/null +++ b/etl/steps/data/grapher/demography/2024-12-02/survivor_percentiles.py @@ -0,0 +1,35 @@ +"""Load a garden dataset and create a grapher dataset.""" + +from etl.helpers import PathFinder, create_dataset + +# Get paths and naming conventions for current step. +paths = PathFinder(__file__) + + +def run(dest_dir: str) -> None: + # + # Load inputs. + # + # Load garden dataset. + ds_garden = paths.load_dataset("survivor_percentiles") + + # Read table from garden dataset. + + # + # Process data. + # + tables = list(ds_garden) + + # + # Save outputs. + # + # Create a new grapher dataset with the same metadata as the garden dataset. + ds_grapher = create_dataset( + dest_dir, + tables=tables, + check_variables_metadata=True, + default_metadata=ds_garden.metadata, + ) + + # Save changes in the new grapher dataset. + ds_grapher.save() diff --git a/etl/steps/data/grapher/hmd/2024-11-27/hmd.py b/etl/steps/data/grapher/hmd/2024-11-27/hmd.py new file mode 100644 index 00000000000..88cb5a79bd8 --- /dev/null +++ b/etl/steps/data/grapher/hmd/2024-11-27/hmd.py @@ -0,0 +1,82 @@ +"""Load a garden dataset and create a grapher dataset.""" + +from etl.helpers import PathFinder, create_dataset + +# Get paths and naming conventions for current step. +paths = PathFinder(__file__) + +INDICATORS_RELEVANT_LT = [ + "central_death_rate", + "life_expectancy", + "probability_of_death", +] + + +def run(dest_dir: str) -> None: + # + # Load inputs. + # + # Load garden dataset. + ds_garden = paths.load_dataset("hmd") + + # Read table from garden dataset. + tb_lt = ds_garden.read("life_tables") + tb_exposure = ds_garden.read("exposures") + tb_deaths = ds_garden.read("deaths") + tb_pop = ds_garden.read("population") + tb_births = ds_garden.read("births") + tb_ratios = ds_garden.read("diff_ratios") + + # Filter relevant dimensions + tb_lt = keep_only_relevant_dimensions(tb_lt) + tb_exposure = keep_only_relevant_dimensions(tb_exposure) + tb_deaths = keep_only_relevant_dimensions(tb_deaths) + tb_pop = keep_only_relevant_dimensions(tb_pop) + tb_ratios = keep_only_relevant_dimensions(tb_ratios) + + # + # Save outputs. + # + cols_index = ["country", "year", "sex", "age", "type"] + tables = [ + tb_lt.format(cols_index), + tb_exposure.format(cols_index), + tb_deaths.format(["country", "year", "sex", "age"]), + tb_pop.format(["country", "year", "sex", "age"]), + tb_births.format(["country", "year", "sex"]), + tb_ratios.format(["country", "year", "age", "type"]), + ] + # Create a new grapher dataset with the same metadata as the garden dataset. + ds_grapher = create_dataset( + dest_dir, tables=tables, check_variables_metadata=True, default_metadata=ds_garden.metadata + ) + + # Save changes in the new grapher dataset. + ds_grapher.save() + + +def keep_only_relevant_dimensions(tb): + """Keep only relevant dimensions. + + - We only preserve 5-year age groups, and specific 1-year age groups. + - We only preserve 1-year observation periods. + + """ + AGES_SINGLE = [ + 0, + 10, + 15, + 25, + 45, + 65, + 80, + ] + AGES_SINGLE = list(map(str, AGES_SINGLE)) + ["110+"] + flag_1 = tb["age"].isin(AGES_SINGLE) + flag_2 = tb["age"].str.contains( + "-", + ) + + tb = tb.loc[flag_1 | flag_2] + + return tb diff --git a/etl/steps/data/meadow/hmd/2024-11-27/hmd.py b/etl/steps/data/meadow/hmd/2024-11-27/hmd.py new file mode 100644 index 00000000000..d4a56860adf --- /dev/null +++ b/etl/steps/data/meadow/hmd/2024-11-27/hmd.py @@ -0,0 +1,325 @@ +"""Load a snapshot and create a meadow dataset.""" + +import re +from io import StringIO +from pathlib import Path +from typing import List + +import owid.catalog.processing as pr +from owid.catalog import Table + +from etl.helpers import PathFinder, create_dataset + +# Get paths and naming conventions for current step. +paths = PathFinder(__file__) + + +# Life tables +FOLDERS_LT = [ + "lt_male", + "lt_female", + "lt_both", + "c_lt_male", + "c_lt_female", + "c_lt_both", +] +REGEX_LT = ( + r"(?P[a-zA-Z\-\s,]+), Life tables \((?P[a-zA-Z]+) (?P\d+x\d+)\), (?P[a-zA-Z]+)" + r"\tLast modified: (?P\d+ [a-zA-Z]{3} \d+); Methods Protocol: v\d+ \(\d+\)\n\n(?P(?s:.)*)" +) +COLUMNS_RENAME_LT = { + "mx": "central_death_rate", + "qx": "probability_of_death", + "ax": "average_survival_length", + "lx": "number_survivors", + "dx": "number_deaths", + "Lx": "number_person_years_lived", + "Tx": "number_person_years_remaining", + "ex": "life_expectancy", +} + +# Exposures +FOLDERS_EXPOSURES = [ + "c_exposures", + "exposures", +] +REGEX_EXP = ( + r"(?P[a-zA-Z\-\s,]+), (?PExposure) to risk \((?P[a-zA-Z]+) (?P\d+x\d+)\),\s\tLast modified: " + r"(?P\d+ [a-zA-Z]{3} \d+); Methods Protocol: v\d+ \(\d+\)\n\n(?P(?s:.)*)" +) + +# Mortality +FOLDERS_MOR = [ + "deaths", +] +REGEX_MOR = ( + r"(?P[a-zA-Z\-\s,]+), (?PDeaths) \((?P[a-zA-Z]+) (?P\d+x\d+|Lexis triangle)\),\s\tLast modified: " + r"(?P\d+ [a-zA-Z]{3} \d+); Methods Protocol: v\d+ \(\d+\)\n\n(?P(?s:.)*)" +) +# Population +FOLDERS_POP = [ + "population", +] +REGEX_POP = ( + r"(?P[a-zA-Z\-\s,]+?),?\s?(?PPopulation) size \((?P1\-year|abridged)\)\s+Last modified: " + r"(?P\d+ [a-zA-Z]{3} \d+)(; Methods Protocol: v\d+ \(\d+\)|,MPv\d \(in development\))\n\n(?P(?s:.)*)" +) +# Births +FOLDERS_BIRTHS = [ + "births", +] +REGEX_BIRTHS = ( + r"(?P[a-zA-Z\-\s,]+),\s+(?PBirths) \((?P1\-year)\)\s+Last modified: " + r"(?P\d+ [a-zA-Z]{3} \d+); Methods Protocol: v\d+ \(\d+\)\n\n(?P(?s:.)*)" +) + + +def run(dest_dir: str) -> None: + # + # Load inputs. + # + # Retrieve snapshot. + snap = paths.load_snapshot("hmd.zip") + + # Load data from snapshot. + with snap.extract_to_tempdir() as tmpdir: + # Population + tb_pop = make_tb( + path=Path(tmpdir), + main_folders=FOLDERS_POP, + regex=REGEX_POP, + snap=snap, + ) + + # Life tables + tb_lt = make_tb( + path=Path(tmpdir), + main_folders=FOLDERS_LT, + regex=REGEX_LT, + snap=snap, + ) + # Exposure + tb_exp = make_tb( + path=Path(tmpdir), + main_folders=FOLDERS_EXPOSURES, + regex=REGEX_EXP, + snap=snap, + ) + # Mortality + tb_m = make_tb( + path=Path(tmpdir), + main_folders=FOLDERS_MOR, + regex=REGEX_MOR, + snap=snap, + ) + + # Births + tb_bi = make_tb( + path=Path(tmpdir), + main_folders=FOLDERS_BIRTHS, + regex=REGEX_BIRTHS, + snap=snap, + ) + + # Life tables + ## Column rename + ## e.g. "Lx -> lx" and "lx -> lx". This will cause an error when setting the index. + tb_lt = tb_lt.rename(columns=COLUMNS_RENAME_LT) + + # Population + ## Invert 'abridged' <-> '1-year' in the type column + message = "Types 'abridged' and '1-year' might not be reversed anymore!" + assert not tb_pop.loc[tb_pop["format"] == "abridged", "Age"].str.contains("-").any(), message + assert tb_pop.loc[tb_pop["format"] == "1-year", "Age"].str.contains("80-84").any(), message + tb_pop["format"] = tb_pop["format"].map( + lambda x: "1-year" if x == "abridged" else "abridged" if x == "1-year" else x + ) + + # Check missing values + _check_nas(tb_lt, 0.01, 14) + _check_nas(tb_exp, 0.23, 47) + _check_nas(tb_m, 0.001, 1) + _check_nas(tb_pop, 0.001, 1) + + # Ensure correct year dtype + tb_lt = _clean_year(tb_lt) + tb_exp = _clean_year(tb_exp) + tb_m = _clean_year(tb_m) + tb_bi = _clean_year(tb_bi) + tb_pop = _clean_population_type(tb_pop) + + # Ensure all columns are snake-case, set an appropriate index, and sort conveniently. + tables = [ + tb_lt.format(["country", "year", "sex", "age", "type", "format"], short_name="life_tables"), + tb_exp.format(["country", "year", "sex", "age", "type", "format"], short_name="exposures"), + tb_m.format(["country", "year", "sex", "age", "type", "format"], short_name="deaths"), + tb_pop.format(["country", "year", "sex", "age", "format"], short_name="population"), + tb_bi.format(["country", "year", "sex", "format"], short_name="births"), + ] + + # + # Save outputs. + # + # Create a new meadow dataset with the same metadata as the snapshot. + ds_meadow = create_dataset( + dest_dir, + tables=tables, + check_variables_metadata=True, + default_metadata=snap.metadata, + ) + + # Save changes in the new meadow dataset. + ds_meadow.save() + + +def make_tb(path: Path, main_folders: List[str], regex: str, snap) -> Table: + """Create table from multiple category folders. + + It inspects the content in `main_folders` (should be in `path`), and looks for TXT files to parse into tables. + + The output is a table with the relevant indicators and dimensions for all the categories. + + Arguments: + path: Path where the HMD export is located. + main_folders: List of folders to consider in `path`. These should typically be categories, which + group different individual indicators + regex: Regex to extract the metadata for a set of TXTs file found in main_folders. We need this + because the structure of the header in the TXT files slightly varies depending on + the indicator. + """ + # List with all relevant tables + tbs = [] + # Iterate over each top-level folder + for category_folder in main_folders: + main_folder_path = path / category_folder + if not main_folder_path.is_dir(): + raise FileNotFoundError(f"Folder {main_folder_path} not found in {path}") + # Iterate over each indicator folder + for indicator_path in main_folder_path.iterdir(): + if "lexis" in indicator_path.name: + continue + if indicator_path.is_dir(): + # Read all TXT files in the indicator folder, and put them as a single table + paths.log.info(f"Creating list of tables from available files in {path}...") + files = list(indicator_path.glob("*.txt")) + tbs_ = [make_tb_from_txt(f, regex, snap) for f in files] + tbs.extend(tbs_) + # Concatenate all dataframes + tb = pr.concat(tbs, ignore_index=True) + return tb + + +def make_tb_from_txt(text_path: Path, regex: str, snap) -> Table: + """Create a table from a TXT file.""" + # print(text_path) + # Extract fields + groups = extract_fields(regex, text_path) + + # Build df + tb = parse_table(groups["data"], snap) + + # Optional melt + if ("Female" in tb.columns) and ("Male" in tb.columns): + id_vars = [col for col in ["Age", "Year"] if col in tb.columns] + if "name" not in groups: + raise ValueError( + f"Indicator name not found in {text_path}! Please revise that source files' content matches FILE_REGEX." + ) + tb = tb.melt(id_vars=id_vars, var_name="sex", value_name=groups["name"]) + + # Add dimensions + tb = tb.assign( + country=groups["country"], + ) + + # Optional sex column + if "sex" in groups: + tb["sex"] = groups["sex"] + if "format" in groups: + tb["format"] = groups["format"] + if "type" in groups: + tb["type"] = groups["type"] + return tb + + +def extract_fields(regex: str, path: Path) -> dict: + """Structure the fields in the raw TXT file.""" + # Read single file + with open(path, "r") as f: + text = f.read() + # Get relevant fields + match = re.search(regex, text) + if match is not None: + groups = match.groupdict() + else: + raise ValueError(f"No match found in {f}! Please revise that source files' content matches FILE_REGEX.") + return groups + + +def parse_table(data_raw: str, snap): + """Given the raw data from the TXT file (as string) map it to a table.""" + tb_str = data_raw.strip() + tb_str = re.sub(r"\n\s+", "\n", tb_str) + tb_str = re.sub(r"[^\S\r\n]+", "\t", string=tb_str) + tb = pr.read_csv( + StringIO(tb_str), + sep="\t", + na_values=["."], + metadata=snap.to_table_metadata(), + origin=snap.m.origin, + ) + + return tb + + +def _check_nas(tb, missing_row_max, missing_countries_max): + """Check missing values & countries in data.""" + row_nans = tb.isna().any(axis=1) + assert ( + row_nans.sum() / len(tb) < missing_row_max + ), f"Too many missing values in life tables: {row_nans.sum()/len(tb)}" + + # Countries missing + countries_missing_data = tb.loc[row_nans, "country"].unique() + assert ( + len(countries_missing_data) / len(tb) < missing_countries_max + ), f"Too many missing values in life tables: {len(countries_missing_data)}" + + +def _clean_population_type(tb): + """Data provider notes the following: + + For populations with territorial changes, two sets of population estimates are given for years in which a territorial change occurred. The first set of estimates (identified as year "19xx-") refers to the population just before the territorial change, whereas the second set (identified as year "19xx+") refers to the population just after the change. For example, in France, the data for "1914-" cover the previous territory (i.e., as of December 31, 1913), whereas the data for "1914+" reflect the territorial boundaries as of January 1, 1914. + + To avoid confusion and duplicity, whenever there are multiple entries for a year, we keep YYYY+ definition for the year (e.g. country with new territorial changes). + """ + # Crete new column with the year. + regex = r"\b\d{4}\b" + tb["year"] = tb["Year"].astype("string").str.extract(f"({regex})", expand=False) + assert tb["year"].notna().all(), "Year extraction was successful!" + tb["year"] = tb["year"].astype(int) + + # Ensure raw year is as expected + assert ( + tb.groupby(["country", "year", "Age", "sex", "format"]).Year.nunique().max() == 2 + ), "Unexpected number of years (+/-)" + + # Drop duplicate years, keeping YYYY+. + tb["Year"] = tb["Year"].astype("string") + tb = tb.sort_values("Year") + tb = tb.drop_duplicates(subset=["year", "Age", "sex", "country", "format"], keep="first").drop(columns="Year") + + tb = tb.rename(columns={"year": "Year"}) + + # Additionally, remove year periods + tb = _clean_year(tb) + + return tb + + +def _clean_year(tb): + # Remove year ranges, and convert to int + flag = tb["Year"].astype("string").str.contains("-") + tb = tb.loc[~flag] + tb["Year"] = tb["Year"].astype("int") + return tb diff --git a/snapshots/hmd/2024-11-27/hmd.py b/snapshots/hmd/2024-11-27/hmd.py new file mode 100644 index 00000000000..c5180a4dfa6 --- /dev/null +++ b/snapshots/hmd/2024-11-27/hmd.py @@ -0,0 +1,25 @@ +"""Script to create a snapshot of dataset.""" + +from pathlib import Path + +import click + +from etl.snapshot import Snapshot + +# Version for current snapshot dataset. +SNAPSHOT_VERSION = Path(__file__).parent.name + + +@click.command() +@click.option("--upload/--skip-upload", default=True, type=bool, help="Upload dataset to Snapshot") +@click.option("--path-to-file", "-f", prompt=True, type=str, help="Path to local data file.") +def main(path_to_file: str, upload: bool) -> None: + # Create a new snapshot. + snap = Snapshot(f"hmd/{SNAPSHOT_VERSION}/hmd.zip") + + # Copy local data file to snapshots data folder, add file to DVC and upload to S3. + snap.create_snapshot(filename=path_to_file, upload=upload) + + +if __name__ == "__main__": + main() diff --git a/snapshots/hmd/2024-11-27/hmd.zip.dvc b/snapshots/hmd/2024-11-27/hmd.zip.dvc new file mode 100644 index 00000000000..63f4dfe3a51 --- /dev/null +++ b/snapshots/hmd/2024-11-27/hmd.zip.dvc @@ -0,0 +1,74 @@ +# Learn more at: +# http://docs.owid.io/projects/etl/architecture/metadata/reference/ +meta: + origin: + # Data product / Snapshot + title: Human Mortality Database + description: |- + The Human Mortality Database (HMD) contains original calculations of all-cause death rates and life tables for national populations (countries or areas), as well as the input data used in constructing those tables. The input data consist of death counts from vital statistics, plus census counts, birth counts, and population estimates from various sources. + + + # Scope and basic principles + + The database is limited by design to populations where death registration and census data are virtually complete, since this type of information is required for the uniform method used to reconstruct historical data series. As a result, the countries and areas included here are relatively wealthy and for the most part highly industrialized. + + The main goal of the Human Mortality Database is to document the longevity revolution of the modern era and to facilitate research into its causes and consequences. As much as possible, the authors of the database have followed four guiding principles: comparability, flexibility, accessibility, reproducibility. + + + # Computing death rates and life tables + + Their process for computing mortality rates and life tables can be described in terms of six steps, corresponding to six data types that are available from the HMD. Here is an overview of the process: + + 1. Births. Annual counts of live births by sex are collected for each population over the longest possible time period. These counts are used mainly for making population estimates at younger ages. + 2. Deaths. Death counts are collected at the finest level of detail available. If raw data are aggregated, uniform methods are used to estimate death counts by completed age (i.e., age-last-birthday at time of death), calendar year of death, and calendar year of birth. + 3. Population size. Annual estimates of population size on January 1st are either obtained from another source or are derived from census data plus birth and death counts. + 4. Exposure-to-risk. Estimates of the population exposed to the risk of death during some age-time interval are based on annual (January 1st) population estimates, with a small correction that reflects the timing of deaths within the interval. + 5. Death rates. Death rates are always a ratio of the death count for a given age-time interval divided by an estimate of the exposure-to-risk in the same interval. + 6. Life tables. To build a life table, probabilities of death are computed from death rates. These probabilities are used to construct life tables, which include life expectancies and other useful indicators of mortality and longevity. + + + # Corrections to the data + + The data presented here have been corrected for gross errors (e.g., a processing error whereby 3,800 becomes 38,000 in a published statistical table would be obvious in most cases, and it would be corrected). However, the authors have not attempted to correct the data for systematic age misstatement (misreporting of age) or coverage errors (over- or under-enumeration of people or events). + + Some available studies assess the completeness of census coverage or death registration in the various countries, and more work is needed in this area. However, in developing the database thus far, the authors did not consider it feasible or desirable to attempt corrections of this sort, especially since it would be impossible to correct the data by a uniform method across all countries. + + + # Age misreporting + + Populations are included here if there is a well-founded belief that the coverage of their census and vital registration systems is relatively high, and thus, that fruitful analyses by both specialists and non-specialists should be possible with these data. Nevertheless, there is evidence of both age heaping (overreporting ages ending in "0" or "5") and age exaggeration in these data. + + In general, the degree of age heaping in these data varies by the time period and population considered, but it is usually no burden to scientific analysis. In most cases, it is sufficient to analyze data in five-year age groups in order to avoid the false impressions created by this particular form of age misstatement. + + Age exaggeration, on the other hand, is a more insidious problem. The authors' approach is guided by the conventional wisdom that age reporting in death registration systems is typically more reliable than in census counts or official population estimates. For this reason, the authors derive population estimates at older ages from the death counts themselves, employing extinct cohort methods. Such methods eliminate some, but certainly not all, of the biases in old-age mortality estimates due to age exaggeration. + + + # Uniform set of procedures + + A key goal of this project is to follow a uniform set of procedures for each population. This approach does not guarantee the cross-national comparability of the data. Rather, it ensures only that the authors have not introduced biases by the authors' own manipulations. The desire of the authors for uniformity had to face the challenge that raw data come in a variety of formats (for example, 1-year versus 5-year age groups). The authors' general approach to this problem is that the available raw data are used first to estimate two quantities: 1) the number of deaths by completed age, year of birth, and year of death; and 2) population estimates by single years of age on January 1 of each year. For each population, these calculations are performed separately by sex. From these two pieces of information, they compute death rates and life tables in a variety of age-time configurations. + + It is reasonable to ask whether a single procedure is the best method for treating the data from a variety of populations. Here, two points must be considered. First, the authors' uniform methodology is based on procedures that were developed separately, though following similar principles, for various countries and by different researchers. Earlier methods were synthesized by choosing what they considered the best among alternative procedures and by eliminating superficial inconsistencies. The second point is that a uniform procedure is possible only because the authors have not attempted to correct the data for reporting and coverage errors. Although some general principles could be followed, such problems would have to be addressed individually for each population. + + Although the authors adhere strictly to a uniform procedure, the data for each population also receive significant individualized attention. Each country or area is assigned to an individual researcher, who takes responsibility for assembling and checking the data for errors. In addition, the person assigned to each country/area checks the authors' data against other available sources. These procedures help to assure a high level of data quality, but assistance from database users in identifying problems is always appreciated! + date_published: "2024-11-13" + # Citation + producer: Human Mortality Database + citation_full: |- + HMD. Human Mortality Database. Max Planck Institute for Demographic Research (Germany), University of California, Berkeley (USA), and French Institute for Demographic Studies (France). Available at www.mortality.org. + + See also the methods protocol: + Wilmoth, J. R., Andreev, K., Jdanov, D., Glei, D. A., Riffe, T., Boe, C., Bubenheim, M., Philipov, D., Shkolnikov, V., Vachon, P., Winant, C., & Barbieri, M. (2021). Methods protocol for the human mortality database (v6). [Available online](https://www.mortality.org/File/GetDocument/Public/Docs/MethodsProtocolV6.pdf) (needs log in to mortality.org). + attribution_short: HMD + # Files + url_main: https://www.mortality.org/Data/ZippedDataFiles + date_accessed: 2024-11-27 + + # License + license: + name: CC BY 4.0 + url: https://www.mortality.org/Data/UserAgreement + +outs: + - md5: ceed045241a19573e6621423b582558e + size: 147314590 + path: hmd.zip