-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bsweger/more state permutations/37 #45
Conversation
This is a potential breaking change to anyone who was relying on full state name in the location field of sequence metadata datasets. The default is now the two letter state abbreviation, which is more commonly used in hub tasks.json files.
Documenting the filter_covid_genome_metadata function made it clear that its name should better align with the CladeTime vernacular ('sequence metadata'). Additionally, it makes more sense to move sequence.py out of the utility folder, since sequence-related functions are integral to the package. util/sequence.py remains in the code base for now, so we can import the filtering function with its old name for backward compatibility.
38a328a
to
520add4
Compare
src/cladetime/_typing.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed this file to something more consistent with it's new use (previously, it contained some custom types for mypy, which never actually worked as intended)
src/cladetime/sequence.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a usability perspective, it makes sense to treat "sequence" as a first-class cladetime citizen, rather than something buried in a util folder, so moved it.
The only differences are to the "filter metadata" function:
- function is renamed from
filter_covid_genome_metadata
tofilter_sequence_metadata
for consistent terminology withCladeTime
object - added parameter for state format
- added docstring
df_assignments = df_assignments.insert_column(1, seq) # type: ignore | ||
|
||
return df_assignments | ||
# For temporary backwards compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure existing util.sequence
imports continue to work
@@ -78,7 +77,7 @@ def test_filter_covid_genome_metadata(): | |||
"Homo sapiens", | |||
], | |||
"country": ["USA", "Argentina", "USA", "USA", "USA", "USA", "USA"], | |||
"division": ["Alaska", "Maine", "Guam", "Puerto Rico", "Utah", "Pennsylvania", "Pennsylvania"], | |||
"division": ["Alaska", "Maine", "Guam", "Puerto Rico", "Utah", "Washington DC", "Pennsylvania"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The us
package has some special "Washington DC" handling, so make sure we're testing for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall; I had some questions
src/cladetime/util/sequence.py
Outdated
"""Apply a standard set of filters to the GenBank genome metadata.""" | ||
|
||
# Default columns to include in the filtered metadata | ||
if len(cols) == 0: | ||
if not cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not if cols is None:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
if cols is None:
?
Laziness that resulted in a frivolous falsy? (you're right, and I pushed a fix)
src/cladetime/util/sequence.py
Outdated
|
||
A helper function to apply commonly-used filters to a Polars LazyFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: This will still work if I call it with a pl.DataFrame
instead of a pl.DataFrame
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call--just pushed an update to include DataFrame
in the docstring
This function can also accept and return a polars DataFrame (in addition to a LazyFrame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even more approved
Fixes #37
Background
Add options for state format when returning filtered sequence metadata: FIPS code, full state name (formerly the only option), or two-letter state abbr (the new default).
The primary impact will be to the
unscored-location-dates
files in the variant-nowcast-hub. Here's a sample of the updated file created with this branch (via a local run).This changeset doesn't change the existing
unscored-location-dates
files, which still have full state names (we can add an issue for updating the historical data if needed).Related updates
Notes to reviewer
get_location_date_counts.py
andget_clades_to_model.py
scripts invariant-nowcast-hub
have been tested against this feature branch to that the backwards compatibility is working as expected