Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NREL Siting Lab dataset archiver #585

Open
wants to merge 13 commits into
base: cambium
Choose a base branch
from
Open

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Feb 12, 2025

Overview

Closes #584

What problem does this address?

  • Add an archiver to grab all NREL siting data. The links on the main page are dynamically generated through Javascript, so we POST to the API to get them before looping through each page and archiving the data. We end up with one zip file per dataset.

What did you change in this PR?

  • Added the archiver and the archiver source metadata

Questions for the reviewer:

  • I've left working_partitions blank in the datapackage.json file because I dynamically generate all the URLs. Is this a problem if we aren't planning to access the data through our datastore?

Testing

How did you make sure this worked? How can a reviewer verify this?
https://github.com/catalyst-cooperative/pudl-archiver/actions/runs/13296766499

To-do list

Tasks

Preview Give feedback

@e-belfer e-belfer requested a review from cmgosnell February 12, 2025 23:14
@e-belfer e-belfer self-assigned this Feb 12, 2025
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data itself looks great! a few suggestions for clarity but the data all looks there when i generated the archive locally when i looked for it for several of the datasets

Comment on lines 88 to 99
# A few datasets have an additional linked data page:
# e.g., https://data.openei.org/submissions/1932
additional_datasets_pattern = re.compile(r"\/submissions\/\d{4}")
links = await self.get_hyperlinks(dataset_link, additional_datasets_pattern)

# For each additional dataset linked, iterate through the same process
for link in links:
additional_dataset_id = link.split("/")[-1]
additional_data_paths_in_archive = await self.download_nrel_data(
dataset_id=additional_dataset_id, dataset_link=link
)
data_links.update(additional_data_paths_in_archive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nit but i think this would seem more reasonable to happen in download_nrel_data where all the other links are being compiled

@e-belfer e-belfer changed the base branch from main to cambium February 17, 2025 14:43
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay without looking at a new version of the archive itself the archiver code looks great! thanks for the additional clarity in the api data and the comments! very helpful


name: str = "nrelsiting"
base_url: str = "https://data.openei.org/siting_lab"
concurrency_limit = 1 # The server can get a bit cranky, so let's be nice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Write an archiver for NREL Siting Datasets
2 participants