Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect generator_ids in 2023 data #3987

Open
grgmiller opened this issue Dec 4, 2024 · 5 comments
Open

Incorrect generator_ids in 2023 data #3987

grgmiller opened this issue Dec 4, 2024 · 5 comments
Assignees
Labels
bug Things that are just plain broken.

Comments

@grgmiller
Copy link
Collaborator

Describe the bug

in our OGE pipeline, we're getting warnings about missing prime mover codes for certain generators, like plant 57991 generator PV2. However, looking through all of the raw EIA data, I can't seem to find any record of a "PV2" generator at this plant. It appears in out_eia__yearly_generators and core_eia__scd_generators starting in 2023, but it has mostly missing values, and a data_maturity of NA.

It appears this is a pudl bug where some records are getting mixed up.

Another example: one of the other issues is a new generator "HB2PV" at plant 1 (apparently). However, searching through the raw EIA-860 data, it looks like this generator is actually associated with plant 65851 - not sure how it is getting associated with plant 1
It also looks like there is a generator "61552" associated with plant 61153 - this looks suspiciously like a plant code.

Here's a list of all of the generators without a data_maturity, and which appear to be not real generators:

plant_id_eia | generator_id | report_date | data_maturity
-- | -- | -- | --
67744 | RS1 | 2023-01-01 | <NA>
67295 | 3658 | 2023-01-01 | <NA>
66897 | ALBPV | 2023-01-01 | <NA>
66502 | FWSOL | 2023-01-01 | <NA>
66222 | 61194 | 2023-01-01 | <NA>
66147 | 5653 | 2023-01-01 | <NA>
65859 | GEN03 | 2023-01-01 | <NA>
65647 | GEN1 | 2023-01-01 | <NA>
65550 | 65789 | 2023-01-01 | <NA>
65550 | 64921 | 2023-01-01 | <NA>
65550 | 63843 | 2023-01-01 | <NA>
65084 | ELDPV | 2023-01-01 | <NA>
64966 | GEN1 | 2023-01-01 | <NA>
64876 | OHAMP | 2023-01-01 | <NA>
64436 | WLB | 2023-01-01 | <NA>
64182 | PRAPV | 2023-01-01 | <NA>
64094 | PBS0L | 2023-01-01 | <NA>
63541 | 63257 | 2023-01-01 | <NA>
63506 | 63243 | 2023-01-01 | <NA>
63210 | SAINT | 2023-01-01 | <NA>
62975 | SYNLB | 2023-01-01 | <NA>
62760 | SONRI | 2023-01-01 | <NA>
62652 | 63359 | 2023-01-01 | <NA>
62355 | 2WPSO | 2023-01-01 | <NA>
61807 | 66 | 2023-01-01 | <NA>
61752 | 49 | 2023-01-01 | <NA>
61722 | 32 | 2023-01-01 | <NA>
61720 | 30 | 2023-01-01 | <NA>
61716 | 26 | 2023-01-01 | <NA>
61169 | 60798 | 2023-01-01 | <NA>
61153 | 61552 | 2023-01-01 | <NA>
60797 | 61168 | 2023-01-01 | <NA>
60441 | 1 | 2023-01-01 | <NA>
58644 | All | 2023-01-01 | <NA>
57991 | PV2 | 2023-01-01 | <NA>
34516 | SOL1 | 2023-01-01 | <NA>
9170 | 3093 | 2023-01-01 | <NA>
1 | TE1PV | 2023-01-01 | <NA>
1 | RDYPV | 2023-01-01 | <NA>
1 | MIDPV | 2023-01-01 | <NA>
1 | LUNPV | 2023-01-01 | <NA>
1 | HB2PV | 2023-01-01 | <NA>
1 | CFCPV | 2023-01-01 | <NA>

I also tested this using the nightly build version of the pudl database, downloaded 12/4/2024, and also saw this issue.

Bug Severity

How badly is this bug affecting you?

  • Medium: With some effort, I can work around the bug.

To Reproduce

Using the most recent stable version of the pudl database (stable v2024.11.0)

gens = load_data.load_pudl_table("core_eia860__scd_generators", year=None)
gens[gens["data_maturity"].isna()][["plant_id_eia","generator_id","report_date","data_maturity"]]

Expected behavior

I would expect these mismatched generators to not be there

Software Environment?

Windows, accessed via OGE, using stable v2024.11.0

@aesharpe
Copy link
Member

aesharpe commented Dec 27, 2024

Hi @grgmiller, TLDR I just looked into this bug, and it's happening during a step in the harvesting process that pulls information from columns that go by names other than the generic plant_id_eia and generator_id. This explains why it wasn't searchable in the raw data.

In the eia.py harvesting module, there is a function called _compile_all_entity_records that concatenates entity records from all tables they appear in. There's a step, if mapped_schemas:, that harvests information from columns that contain harvestable information like plant_id_eia by go by names like plant_id_eia_direct_support_1:

[{'operator_utility_id_eia': 'utility_id_eia'}, 
{'plant_id_eia_direct_support_1': 'plant_id_eia', 'generator_id_direct_support_1': 'generator_id'},
{'plant_id_eia_direct_support_2': 'plant_id_eia', 'generator_id_direct_support_2': 'generator_id'},
{'plant_id_eia_direct_support_3': 'plant_id_eia', 'generator_id_direct_support_3': 'generator_id'}]

I stuck a breakpoint in the _compile_all_entity_records function and upon inspecting the final compiled_df saw that the the plant_id_eia: 57991 and generator_id: PV2 came from _core_eia860__generators_energy_storage_mapped_2 (thanks to the table column that isn't passed into the final entity or scd tables). This confirms the theory that the weird PV2 generator ID comes from the mapped_schemas step.

If you look at the _core_eia860__generators_energy_storage table you'll find the plant id 57991 and generator id PV2 pair show up in the plant_id_eia_direct_support_2 and generator_id_direct_support_2 columns. These columns indicate which plants and generators a given energy storage unit is intended to store. In theory they are existing plant and generator pairs and are therefore harvested and added to the plant_id_eia and generator_id values that show up in the out_eia__yearly_generators and core_eia__scd_generators tables you mentioned.

This is also true for:

  • plant_id_eia: 1 and generator_id: HB2PV (from the ...direct_support_1 columns)
  • plant_id_eia: 61553 and generator_id: 61552 (from the ...direct_support_1 columns)
  • ...and presumably the rest of the plant and generator pairs you listed above.

The fact that the plant_id_eia: 57991 and generator_id: PV2 only shows up in the energy storage table direct_support columns is weird though. It's possible it's an error or something. But this is why it's in there!

@aesharpe aesharpe moved this from Backlog to In progress in Catalyst Megaproject Dec 27, 2024
@aesharpe aesharpe moved this from In progress to In review in Catalyst Megaproject Dec 27, 2024
@grgmiller
Copy link
Collaborator Author

Thanks for looking into this @aesharpe! I understand the source of this issue from your description, but I didn't see mention of what the planned next step would be on this issue. Would these mappings remain in or be removed from core_eia860__scd_generators in the future? Is there a good way in the meantime to filter on whether a generator is a "direct support" generator or not?

Currently in OGE, I'm just filtering on whether the data_maturity column is , but I'm not sure if that is a stable solution for this?

@aesharpe
Copy link
Member

aesharpe commented Jan 13, 2025

Hi @grgmiller, sorry for the delay, I was out last week. I can think of a couple solutions, but I'm going to start by emailing EIA to see if I can better understand these generators and why they aren't showing up anywhere other than the energy storage table.

For now, there is no great way to filter other than what you're currently doing. However, it's not ideal that the data_maturity column is blank for these records as that isn't very helpful/informative for users. I'll relay what I hear from EIA and we can go from there!

Is your specific issue that these records have no prime_mover field?

@aesharpe
Copy link
Member

@grgmiller EIA got back to me and said the following:

It appears that for most cases, the storage unit is planned and the plant ID exists but the generator ID does not. We will look into this issue during the 2024 data collection and seek to minimize these issues.

For the time being we can discuss other options.

@e-belfer
Copy link
Member

At minimum, we could follow what I did when I originally harvested these IDs (#3699) and pull any new "fake" plant/generator IDs without any data out of the out and _out tables. They will still be in the core tables because these have a foreign key relationship with the direct support columns. Are you experiencing this problem when you're using the out generator tables, @grgmiller, and would that resolve your problem?

@cmgosnell cmgosnell moved this from In review to New in Catalyst Megaproject Feb 3, 2025
@aesharpe aesharpe moved this from New to Icebox in Catalyst Megaproject Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken.
Projects
Status: Icebox
Development

No branches or pull requests

3 participants