Add NREL Siting Lab dataset archiver #585

e-belfer · 2025-02-12T23:14:51Z

Overview

Closes #584

What problem does this address?

Add an archiver to grab all NREL siting data. The links on the main page are dynamically generated through Javascript, so we POST to the API to get them before looping through each page and archiving the data. We end up with one zip file per dataset.

What did you change in this PR?

Added the archiver and the archiver source metadata

Questions for the reviewer:

I've left working_partitions blank in the datapackage.json file because I dynamically generate all the URLs. Is this a problem if we aren't planning to access the data through our datastore?

Testing

How did you make sure this worked? How can a reviewer verify this?
https://github.com/catalyst-cooperative/pudl-archiver/actions/runs/13296766499

To-do list

Tasks

Give feedback

Debug missing 87843.pdf file!
Check on links with spaces
Spot check for missing files
Update relevant documentation - like comments, docstrings, README, release notes, etc.
Review the PR yourself and call out any questions or issues you have
Options

cmgosnell

the data itself looks great! a few suggestions for clarity but the data all looks there when i generated the archive locally when i looked for it for several of the datasets

src/pudl_archiver/archivers/nrel/siting.py

cmgosnell · 2025-02-13T21:44:36Z

src/pudl_archiver/archivers/nrel/siting.py

+        # A few datasets have an additional linked data page:
+        # e.g., https://data.openei.org/submissions/1932
+        additional_datasets_pattern = re.compile(r"\/submissions\/\d{4}")
+        links = await self.get_hyperlinks(dataset_link, additional_datasets_pattern)
+
+        # For each additional dataset linked, iterate through the same process
+        for link in links:
+            additional_dataset_id = link.split("/")[-1]
+            additional_data_paths_in_archive = await self.download_nrel_data(
+                dataset_id=additional_dataset_id, dataset_link=link
+            )
+            data_links.update(additional_data_paths_in_archive)


this is a nit but i think this would seem more reasonable to happen in download_nrel_data where all the other links are being compiled

cmgosnell

okay without looking at a new version of the archive itself the archiver code looks great! thanks for the additional clarity in the api data and the comments! very helpful

cmgosnell · 2025-02-19T14:50:41Z

src/pudl_archiver/archivers/nrel/nrelsiting.py

+
+    name: str = "nrelsiting"
+    base_url: str = "https://data.openei.org/siting_lab"
+    concurrency_limit = 1  # The server can get a bit cranky, so let's be nice.


e-belfer added the new-data label Feb 12, 2025

e-belfer requested a review from cmgosnell February 12, 2025 23:14

e-belfer self-assigned this Feb 12, 2025

cmgosnell requested changes Feb 13, 2025

View reviewed changes

e-belfer force-pushed the nrel-supply-curves branch from 03e813f to c0f0df6 Compare February 17, 2025 14:43

e-belfer changed the base branch from main to cambium February 17, 2025 14:43

e-belfer added 4 commits February 17, 2025 09:47

Scrape PDFs for EFS

0492a26

Add NREL Siting Lab dataset archiver

be4ba02

Revert to logging all downloads to debug pdf

796a909

Update docs, drop concurrency

fce903b

e-belfer force-pushed the nrel-supply-curves branch from c0f0df6 to fce903b Compare February 17, 2025 14:47

e-belfer and others added 7 commits February 17, 2025 09:49

Remove duplicated NREL EFS from merge

8894756

Update logs, drop concurrency, add timeouts

8ae849d

Restore deleted metadata, fix additional downloads method

15f8089

Merge branch 'cambium' into nrel-supply-curves

36cf492

Switch to using get_json

05776a2

Add dataset descriptions as text files

7c9010f

Clean up docstrings

7bc7ffd

cmgosnell approved these changes Feb 19, 2025

View reviewed changes

e-belfer and others added 2 commits February 24, 2025 04:33

Merge branch 'cambium' into nrel-supply-curves

1c61af8

Add DOIs and add to GHA workflow

5650991

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NREL Siting Lab dataset archiver #585

Add NREL Siting Lab dataset archiver #585

e-belfer commented Feb 12, 2025 •

edited

Loading

Tasks

cmgosnell left a comment

cmgosnell Feb 13, 2025

cmgosnell left a comment

cmgosnell Feb 19, 2025

Add NREL Siting Lab dataset archiver #585

Are you sure you want to change the base?

Add NREL Siting Lab dataset archiver #585

Conversation

e-belfer commented Feb 12, 2025 • edited Loading

Overview

Testing

To-do list

Tasks

cmgosnell left a comment

Choose a reason for hiding this comment

cmgosnell Feb 13, 2025

Choose a reason for hiding this comment

cmgosnell left a comment

Choose a reason for hiding this comment

cmgosnell Feb 19, 2025

Choose a reason for hiding this comment

e-belfer commented Feb 12, 2025 •

edited

Loading