Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure SEC company information tables #4079

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Feb 23, 2025

Overview

Closes #4078 . The first of two (or more) SEC table restructuring PRs - this handles the quarterly filings table and company information tables. It doesn't include the ownership tables.

What problem does this address?

Makes changes to the SEC table structures to be more well normalized and usable.

What did you change?

core_sec10k__quarterly_filings
This table is largely the same. Only minor changes were made.

  • Ensure that the CIK of each filing matches the CIK from the filename string - make a test for this?
  • Update filing_date column description to indicate that it's daily frequency
  • Indicate that report_date is the quarter that the filing_date pertains to
  • Enforce format of exhibit_21_version with a regex - enforce this with a constraint in the field metadata?

Company Information Tables

raw_sec10k__quarterly_company_information

  • Rename core_sec10k__quarterly_company_information to be a raw table

core_sec10k__quarterly_company_information

  • Pivot the raw table so that field values are columns
  • Make report_date and central_index_key the primary key
  • Strip leading "[" from field values
  • Add field metadata for new columns
  • Investigate how many CIKs from the quarterly filings table don't have corresponding harvested company information in this table.
    • I see 257/39183 filer CIKs that aren't harvested into this company information table. It looks like ~200 of these are pre-2000 records.
  • Update row counts in the ETL fast and full row counts CSVs

Questions:

  • central_index_key and report_date are not the natural primary key because there are slight differences in headers harvested from different filings. I prioritized headers from filings where the filer is the same as the record we're harvesting but we do lose some records by forcing the CIK + report date primary key. Any reason to leave filename as the primary key?

out_sec10k__quarterly_company_information

  • Merge utility_id_eia and utility_name_eia onto the core company information table.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Preview Give feedback

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Preview Give feedback

@katie-lamb katie-lamb added the sec10k Issues related to SEC 10K filing data. label Feb 23, 2025
@katie-lamb katie-lamb self-assigned this Feb 23, 2025
@katie-lamb katie-lamb marked this pull request as draft February 23, 2025 18:39
@katie-lamb katie-lamb changed the title Restructure SEC tables Restructure SEC company information tables Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sec10k Issues related to SEC 10K filing data.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Restructure SEC company information tables
1 participant