Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bsweger/more state permutations/37 #45

Merged
merged 8 commits into from
Oct 28, 2024
Merged

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Oct 28, 2024

Fixes #37

Background

Add options for state format when returning filtered sequence metadata: FIPS code, full state name (formerly the only option), or two-letter state abbr (the new default).

The primary impact will be to the unscored-location-dates files in the variant-nowcast-hub. Here's a sample of the updated file created with this branch (via a local run).

target_date,location,count
2024-09-29,AK,0
2024-09-30,AK,0
2024-10-01,AK,0
2024-10-02,AK,0
2024-10-03,AK,0

This changeset doesn't change the existing unscored-location-dates files, which still have full state names (we can add an issue for updating the historical data if needed).

Related updates

  • Because this change involved updates to the function that filters sequence metadata, the PR also includes an updated docstring and corresponding documentation updates.
  • As part of the documentation updates, there is some refactoring to incorporate some of the newer cladetime naming conventions and to tidy package organization (while keep backwards compatibility until we incorporate these changes into the variant-nowcast-hub/src scripts)

Notes to reviewer

  • This one can be reviewed commit by commit
  • The get_location_date_counts.py and get_clades_to_model.py scripts in variant-nowcast-hub have been tested against this feature branch to that the backwards compatibility is working as expected
  • Preview of updated docs

This is a potential breaking change to anyone who was relying on
full state name in the location field of sequence metadata
datasets. The default is now the two letter state abbreviation,
which is more commonly used in hub tasks.json files.
Documenting the filter_covid_genome_metadata function made it
clear that its name should better align with the CladeTime
vernacular ('sequence metadata'). Additionally, it makes more
sense to move sequence.py out of the utility folder, since
sequence-related functions are integral to the package.

util/sequence.py remains in the code base for now, so we can
import the filtering function with its old name for backward
compatibility.
@bsweger bsweger force-pushed the bsweger/more-state-permutations/37 branch from 38a328a to 520add4 Compare October 28, 2024 14:03
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this file to something more consistent with it's new use (previously, it contained some custom types for mypy, which never actually worked as intended)

Copy link
Collaborator Author

@bsweger bsweger Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a usability perspective, it makes sense to treat "sequence" as a first-class cladetime citizen, rather than something buried in a util folder, so moved it.

The only differences are to the "filter metadata" function:

  • function is renamed from filter_covid_genome_metadata to filter_sequence_metadata for consistent terminology with CladeTime object
  • added parameter for state format
  • added docstring

df_assignments = df_assignments.insert_column(1, seq) # type: ignore

return df_assignments
# For temporary backwards compatibility
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure existing util.sequence imports continue to work

@@ -78,7 +77,7 @@ def test_filter_covid_genome_metadata():
"Homo sapiens",
],
"country": ["USA", "Argentina", "USA", "USA", "USA", "USA", "USA"],
"division": ["Alaska", "Maine", "Guam", "Puerto Rico", "Utah", "Pennsylvania", "Pennsylvania"],
"division": ["Alaska", "Maine", "Guam", "Puerto Rico", "Utah", "Washington DC", "Pennsylvania"],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The us package has some special "Washington DC" handling, so make sure we're testing for that

@bsweger bsweger requested a review from elray1 October 28, 2024 14:17
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall; I had some questions

"""Apply a standard set of filters to the GenBank genome metadata."""

# Default columns to include in the filtered metadata
if len(cols) == 0:
if not cols:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not if cols is None:?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not if cols is None:?

Laziness that resulted in a frivolous falsy? (you're right, and I pushed a fix)


A helper function to apply commonly-used filters to a Polars LazyFrame
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: This will still work if I call it with a pl.DataFrame instead of a pl.DataFrame, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call--just pushed an update to include DataFrame in the docstring

src/cladetime/assign_clades.py Outdated Show resolved Hide resolved
This function can also accept and return a polars DataFrame
(in addition to a LazyFrame)
elray1
elray1 previously approved these changes Oct 28, 2024
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even more approved

@bsweger bsweger merged commit 0290dd9 into main Oct 28, 2024
2 checks passed
@bsweger bsweger deleted the bsweger/more-state-permutations/37 branch October 28, 2024 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In unscored-location-dates, use two-letter abbreviations for states to match variant nowcast hub convention
2 participants