Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📜 private datasets #3561

Merged
merged 10 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 25 additions & 221 deletions docs/architecture/workflow/index.md

Large diffs are not rendered by default.

31 changes: 26 additions & 5 deletions docs/architecture/workflow/other-steps.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,35 @@
---
status: new
---

So far you have learned about the standard steps. These should cover most of the cases. However, there are some other steps worth mentioning.

## Explorers
## Export steps

Sometimes we want to perform an action instead of creating a dataset. For instance, we might want to create a TSV file for an explorer, commit a CSV to a GitHub repository, or create a config for a multi-dimensional indicator. This is where the `Export` step comes in.

Export steps are used to perform an action on an already created dataset. This action typically implies making the data available to other parts of the system. There are different types of export steps:

- **Explorers**: Create a TSV file for a data explorer.
- **Multi-dimensional indicators**: Create a configuration for a multi-dimensional indicator.
- **Export to GitHub**: Commit a dataset to a GitHub repository.

Export steps should be used after the data has been processed and is ready to be used (post-Garden).

!!! note "Learn more about [export steps](../../guides/data-work/export-data.md)"

Data explorers are Grapher charts expanded with additional functionalities to facilitate exploration, such as dynamic entity filters or customizable menus. They are powered by CSV files generated by ETL [served from S3](https://dash.cloudflare.com/078fcdfed9955087315dd86792e71a7e/r2/default/buckets/owid-catalog). Explorers data step in ETL is responsible for generating these CSV files. It works in the same way as e.g. garden step, but the transformations made there are meant to get the data ready for the data explorer (and not be consumed by users of catalog).
### Explorers

Data explorers are Grapher charts expanded with additional functionalities to facilitate exploration, such as dynamic entity filters or customizable menus. They are usually powered by indicators from OWID's Grapher database.

!!! info "Learn more about creating Data explorers [on Notion :octicons-arrow-right-24:](https://www.notion.so/owid/Creating-Data-Explorers-cf47a5ef90f14c1fba8fc243aba79be7)."

!!! note "Legacy explorers"

In the past Explorers were manually defined from our Admin. Data was sourced by CSV files generated by ETL [served from S3](https://dash.cloudflare.com/078fcdfed9955087315dd86792e71a7e/r2/default/buckets/owid-catalog), or on GitHub.

We have slowly transitioned into a new system where explorers are generated from the ETL pipeline. This is a more scalable and maintainable solution.

## Backport

Datasets from our production grapher database can be backported to ETL catalog.
Expand Down Expand Up @@ -42,9 +66,6 @@ flowchart LR
classDef node_ss fill:#002147,color:#fff
```

## Open Numbers

!!! warning "TO BE DONE"

## ETag

Expand Down
1 change: 0 additions & 1 deletion docs/guides/auto-regular-updates.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
tags:
- 👷 Staff
status: new
---

!!! warning "This is a work in progress"
Expand Down
157 changes: 157 additions & 0 deletions docs/guides/data-work/export-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
status: new
---

!!! warning "Export steps are a work in progress"

Export steps are defined in `etl/steps/export` directory and have similar structure to regular steps. They are run with the `--export` flag:

```bash
etlr export://explorers/minerals/latest/minerals --export
```

The `def run(dest_dir):` function doesn't save a dataset, but calls a method that performs the action. For instance `create_explorer(...)` or `gh.commit_file_to_github(...)`. Once the step is executed successfully, it won't be run again unless its code or dependencies change (it won't be "dirty").

## Creating explorers

TSV files for explorers are created using the `create_explorer` function, usually from a configuration YAML file

```py
# Create a new explorers dataset and tsv file.
ds_explorer = create_explorer(dest_dir=dest_dir, config=config, df_graphers=df_graphers)
ds_explorer.save()
```

!!! info "Creating explorers on staging servers"

Explorers can be created or edited on staging servers and then manually migrated to production. Each staging server creates a branch in the `owid-content` repository. Editing explorers in Admin or running the `create_explorer` function pushes changes to that branch. Once the PR is merged, the branch gets pushed to the `owid-content` repository (not to the `master` branch, but its own branch). You then need to manually create a PR from that branch and merge it into `master`.


## Creating multi-dimensional indicators

Multi-dimensional indicators are powered by a configuration that is typically created from a YAML file. The structure of the YAML file looks like this:

```yaml title="etl/steps/export/multidim/covid/latest/covid.deaths.yaml"
definitions:
table: {definitions.table}

title:
title: COVID-19 deaths
titleVariant: by interval
defaultSelection:
- World
- Europe
- Asia
topicTags:
- COVID-19

dimensions:
- slug: interval
name: Interval
choices:
- slug: weekly
name: Weekly
description: null
- slug: biweekly
name: Biweekly
description: null

- slug: metric
name: Metric
choices:
- slug: absolute
name: Absolute
description: null
- slug: per_capita
name: Per million people
description: null
- slug: change
name: Change from previous interval
description: null

views:
- dimensions:
interval: weekly
metric: absolute
indicators:
y: "{definitions.table}#weekly_deaths"
- dimensions:
interval: weekly
metric: per_capita
indicators:
y: "{definitions.table}#weekly_deaths_per_million"
- dimensions:
interval: weekly
metric: change
indicators:
y: "{definitions.table}#weekly_pct_growth_deaths"

- dimensions:
interval: biweekly
metric: absolute
indicators:
y: "{definitions.table}#biweekly_deaths"
- dimensions:
interval: biweekly
metric: per_capita
indicators:
y: "{definitions.table}#biweekly_deaths_per_million"
- dimensions:
interval: biweekly
metric: change
indicators:
y: "{definitions.table}#biweekly_pct_growth_deaths"
```

The `dimensions` field specifies selectors, and the `views` field defines views for the selection. Since there are numerous possible configurations, `views` are usually generated programmatically. However, it's a good idea to create a few of them manually to start.

You can also combine manually defined views with generated ones. See the `etl.multidim` module for available helper functions or refer to examples from `etl/steps/export/multidim/`. Feel free to add or modify the helper functions as needed.

The export step loads the YAML file, adds `views` to the config, and then calls the function.

```python title="etl/steps/export/multidim/covid/latest/covid.py"
def run(dest_dir: str) -> None:
engine = get_engine()

# Load YAML file
config = paths.load_mdim_config("covid.deaths.yaml")

multidim.upsert_multidim_data_page("mdd-energy", config, engine)
```

To see the multi-dimensional indicator in Admin, run

```bash
etlr export://multidim/energy/latest/energy --export
```

and check out the preview at http://staging-site-my-branch/admin/grapher/mdd-name.


## Exporting data to GitHub

One common use case for the `export` step is to commit a dataset to a GitHub repository. This is useful when we want to make a dataset available to the public. The pattern for this looks like this:

```python
if os.environ.get("CO2_BRANCH"):
dry_run = False
branch = os.environ["CO2_BRANCH"]
else:
dry_run = True
branch = "master"

gh.commit_file_to_github(
combined.to_csv(),
repo_name="co2-data",
file_path="owid-co2-data.csv",
commit_message=":bar_chart: Automated update",
branch=branch,
dry_run=dry_run,
)
```

This code will commit the dataset to the `co2-data` repository on GitHub if you specify the `CO2_BRANCH` environment variable, i.e.

```bash
CO2_BRANCH=main etlr export://co2/latest/co2 --export
```
2 changes: 0 additions & 2 deletions docs/guides/data-work/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ tags:
- 👷 Staff
---

# Data work

Adding and updating datasets in ETL is part of our routinary work. To this end, we've simplified the process as much as possible. Find below the list of the steps involved in the workflow. Click on each step to learn more about it.

```mermaid
Expand Down
17 changes: 11 additions & 6 deletions docs/guides/private-import.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@ tags:
- 👷 Staff
---

While most of the data at OWID is publicly available, some datasets are added to our catalogue with some restrictions. These include datasets that are not redistributable, or that are not meant to be shared with the public. This can happen due to a strict license by the data provider, or because the data is still in a draft stage and not ready for public consumption.
While most of the data at OWID is publicly available, some datasets are added to our catalog with some restrictions. These include datasets that are not redistributable, or that are not meant to be shared with the public. This can happen due to a strict license by the data provider, or because the data is still in a draft stage and not ready for public consumption.

Various privacy configurations are available:

- Skip re-publishing to GitHub.
- Disable data downloading options on Grapher.
- Disable public access to the original file (snapshot).
- Hide the dataset from our public catalog (accessible via `owid-catalog-py`).
Expand All @@ -16,6 +15,12 @@ In the following, we explain how to create private steps in the ETL pipeline and

## Create a private step


!!! tip "Make your dataset completely private"

- **Snapshot**: Set `meta.is_public` to `false` in the snapshot DVC file.
- **Meadow, Garden, Grapher**: Use `data-private://` prefix in the step name in the DAG. Set `dataset.non_redistributable` to `true` in the dataset garden metadata.

### Snapshot

To create a private snapshot step, set the `meta.is_public` property in the snapshot .dvc file to false:
Expand All @@ -34,7 +39,7 @@ This will prevent the file to be publicly accessible without the appropriate cre

### Meadow, Garden, Grapher

Creating a private data step means that the data will not be listed in the public catalog, and therefore will not be accessible via `owid-catalog-py`. In addition, private datasets will not be re-published to GitHub.
Creating a private data step means that the data will not be listed in the public catalog, and therefore will not be accessible via `owid-catalog-py`.

To create a private data step (meadow, garden or grapher) simply use `data-private` prefix in the step name in the DAG. For example, the step `grapher/ihme_gbd/2024-06-10/leading_causes_deaths` (this is from [health.yml](https://github.com/owid/etl/blob/master/dag/health.yml)) is private:

Expand Down Expand Up @@ -70,8 +75,8 @@ etl run run [step-name] --private

If you want to make a private step public simply follow the steps below:

- **In the DAG:** Replace `data-private/` prefix with `data/`.
- **In the snapshot DVC file**: Set `meta.is_public` to `true` (or simply remove `is_public` property).
- (Optional) **Allow for Grapher downloads**: Set `dataset.non_redistributable` to `false` in the dataset garden metadata (or simply remove the property from the metadata).
- **In the DAG:** Replace `data-private://` prefix with `data://`.
- **In the snapshot DVC file**: Set `meta.is_public` to `true` (or simply remove this property).
- (Optional) **Allow for Grapher downloads**: Set `dataset.non_redistributable` to `false` in the dataset garden metadata (or simply remove this property).

After this, re-run the snapshot step and commit your changes.
2 changes: 1 addition & 1 deletion docs/ignore/generate_dynamic_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

- __[Indicator](#variable)__ (variable)
- __[Origin](#origin)__
- __[Table](#tables)__
- __[Table](#table)__
- __[Dataset](#dataset)__
</div>

Expand Down
13 changes: 13 additions & 0 deletions docs/overrides/main_aux.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{% extends "base.html" %}

{% block content %}
{{ super() }}

{% if git_page_authors %}
<div class="md-source-date">
<small>
Authors: {{ git_page_authors | default('enable mkdocs-git-authors-plugin') }}
</small>
</div>
{% endif %}
{% endblock %}
41 changes: 26 additions & 15 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@ extra:
link: https://ourworldindata.org
- icon: fontawesome/brands/instagram
link: https://instagram.com/ourworldindata
- icon: fontawesome/brands/bluesky
link: https://bsky.app/profile/ourworldindata.org
- icon: fontawesome/brands/x-twitter
link: https://twitter.com/ourworldindata

Expand Down Expand Up @@ -149,9 +151,12 @@ plugins:
- git-authors:
show_email_address: false
# authorship_threshold_percent: 1
# show_contribution: true
show_contribution: true
# show_line_count: true
# count_empty_lines: true
ignore_authors:
- owidbot
sort_authors_by: contribution
- git-revision-date-localized
- tags:
tags_file: tags.md
Expand Down Expand Up @@ -205,23 +210,29 @@ nav:
- Contributing: "contributing.md"
- Guides:
- "guides/index.md"
- Data work:
- Adding data:
- "guides/data-work/index.md"
- Adding data: "guides/data-work/add-data.md"
- New data: "guides/data-work/add-data.md"
- Updating data: "guides/data-work/update-data.md"
- Update charts: "guides/data-work/update-charts.md"
- Wizard: "guides/wizard.md"
- CLI: "guides/etl-cli.md"
- Harmonize country names: "guides/harmonize-countries.md"
- Using different environments: "guides/environment.md"
- Staging servers: "guides/staging-servers.md"
- Private dataset import to ETL: "guides/private-import.md"
- Automate regular updates: "guides/auto-regular-updates.md"
- Backport a dataset to ETL: "guides/backport.md"
- Metadata in data pages: "guides/metadata-play.md"
- Edit the documentation: "dev/docs.md"
- OpenAI setup: "guides/openai.md"
- Sharing with external people: "guides/sharing-external.md"
- Export data: "guides/data-work/export-data.md"
- Main tools:
- Wizard: "guides/wizard.md"
- CLI: "guides/etl-cli.md"
- Harmonize country names: "guides/harmonize-countries.md"
- Backport from database: "guides/backport.md"
- Regular updates: "guides/auto-regular-updates.md"
- Servers & settings:
- Environments: "guides/environment.md"
- Staging servers: "guides/staging-servers.md"
- Public servers: "guides/sharing-external.md"
- Private datasets: "guides/private-import.md"
- OpenAI setup: "guides/openai.md"

- Others:
- Edit the documentation: "dev/docs.md"
- Metadata in data pages: "guides/metadata-play.md"


- Design principles:
- Design principles & workflow: architecture/index.md
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,11 @@ dev-dependencies = [
"boto3-stubs[s3]>=1.34.154",
"gspread>=5.12.4",
"jsonref>=1.1.0",
"mkdocs-material>=9.5.34",
"mkdocs-jupyter>=0.24.8",
"mkdocs-exclude>=1.0.2",
"mkdocs-gen-files>=0.5.0",
"mkdocs-git-authors-plugin>=0.7.2",
"mkdocs-git-authors-plugin>=0.9.2",
"mkdocs-git-revision-date-localized-plugin>=1.2.6",
"mkdocs-click>=0.8.1",
"mkdocs-glightbox>=0.3.7",
Expand Down
Loading
Loading